Let’s load it up!
# Load some packages
library(bayesrules)
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom.mixed)
library(tidybayes)
We might want to build a regression model with more than one predictor if multiple predictors are highly associated with the response variable of interest. If we neglect to utilize more than one predictor when there are multiple predictors at play, we will get a worse approximation of the observed data and likely a worse prediction of future data because we will be missing something crucial to the way the response variable is influenced.
You want to model a car’s miles per gallon in a city (Y) by the make of the car: Ford, Kia, Subaru, or Toyota. The relationship can be written as: \(\mu = \beta_0 + \beta_1X_1 + \beta_2X2 + \beta_3X_3\) where the Xs are indicators of whether or not the cars are Kias, Subarus, or Toyotas:
The Ford category is the reference category in this case. That is, when X1, X2, and X3 are all 0, this can be interpreted as the Ford indicator, and all other indicator coefficients represent the difference between the other cars makes and the Ford car make.
\(\beta_2\) is the typical difference between the Subaru’s miles per gallon in a city and Ford’s (the baseline) miles per gallon in a city.
\(\beta_0\) is the typical miles per gallon in a city for a Ford car.
I now grow tomatoes and want to grow big ones. I model tomato’s weight in grams (Y) by the number of days it has been growing (X1) and its type, Mr. Stripey or Roma. X2 = 1 indicates a Roma tomato. The linear function is: \(\mu = \beta_0 + \beta_1X_1 + \beta_2X_2\)
\(\beta_0\) is the typical weight of a tomato in grams for a hypothetical Roma tomato that has not grown a day
\(\beta_1\) represents the typical change in tomato weight for a one unit increase in days grown for all tomatoes
\(\beta_2\) represents the typical difference between Mr. Stripey and Roma tomatoes at each value of Y (tomato weight in grams) when they have grown for equal days
If \(\beta_2\) was zero, this would mean that there is no difference in the typical weight of Roma and Mr. Stripey tomatoes if they have had equal days to grow.
Now we are going to use an interaction term between grow time and type: \(\mu = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\)
If X1 and X2 are interacting, that means that the type of tomato (Roma or Mr. Stripey) affects the relationship between weight of the tomato and days grown. That is, the slope of the line between weight of tomatoes and the days it has been growing is different depending on the type of tomato.
\(\beta_3\) captures the difference in slopes between for the relationship between weight and days grown between the two tomato types. So this shows how the change in weight differs between Roma and Mr. Stripey tomatoes.
In this model we have Y = amount of beach pollution and X1 = Hour of the day. This would benefit from a categorical predictor for time of the year between summertime and not summertime. In the summertime, we might expect that people are at the beach throughout the whole day, so the beach pollution might be higher starting earlier. In other seasons, people are only going to the beach presumably after work or school and there are likely less people, so the mean will be later in the day and lower.
Yes Interaction
In this model we have Y = amount of red lights you’ve encountered and X1 = years you’ve been driving. This would not benefit from an interaction term with the categorical variable Female or Male. We would not expect there to be any difference in the relationship between how many red lights you encounter and the number of years you’ve been driving if you are male versus if you are female.
Yes Interaction
First, you can think about context. Does it make sense that there would be an interaction effect in the variables you’ve included? For example, if you are trying to model Y = price of your meal with quantitative predictor X1 = time spent at a restaurant, and categorical predictor X2 = fancy restaurant (1) and fast food restaurant (2) you would likely expect that the relationship between time spent at a restaurant and the price of the meal will differ significantly depending if you are at a fancy restaurant than if you’re at a fast food restaurant.
Second, you can use hypothesis testing to determine if \(\beta_3\) is not equal to zero.
We’ll model child’s shoe size (Y) by two predictors: childs age (X1) and if the child knows how to swim (X2)
It can be good to add predictors to better represent the data that contributes to the response variable of interest. Without them, the model could be too simple and the outcome would be biased and will not actually look like the observed data.
Too many predictors may add too much complexity if they aren’t actually giving us data that will help us model the relationship of interest. More predictors can also overfit the data which will make it so the model is really good at precisely predicting the observed data, but would have a lot of trouble predicting any future data.
I might add child’s height to the model. There is a good amount of variation of the size of a child at each age, so adding a predictor that could capture some of this variation in one sense (in the different heights) might be helpful in addition to age to capture the variation in the response variable of child’s shoe size.
I would remove the indicator for child’s ability to swim. This feels wholly irrelevant to the shoe size of the child. If there is any predictive ability of ability to swim, I might assume that it would be correlated with the existing variable of age. That is, I would guess that the older the child is, the more likely they are to know how to swim. So at best the swim indicator gives us slightly redundant data, and at worse its irrelevant and adding unnecessary complexity.
Good models will be fair and will be in accordance with some of the basic assumptions we have about independence, linearity, and consistent variance (in the linear models we’ve been working with). They will take context in to account and use tests to balance the amount of relevant predictors. They will fit the data well enough that it produces good predictions but not too well so as to be useless for other data sets. They will pass our diagnostic checks.
Bad models will be unfair and violate some of the basic assumptions of independence, linearity and consistent variance (in the linear regressions we have been doing). They will not make logical sense in context and use a lot of predictors just because or only use one when there is a lot more happening in the data. They will not fit the data well and give bad prediction checks, and fail some of the diagnostic tests we have for models (e.g., mcmc mixing, rhat, neff ratio, visual checks).
What techniques have you learned in this chapter to assess and compare models?
We’ve learned three techniques, visualization, cross-validation comparison, and ELPD comparison.
visualization - we can use ppc_intervals or ppc_violin_grouped to look at the posterior predictive models for multiple models. We can see which models capture the the observed Y data in the 50% and 95% within intervals and how narrow those intervals are compared to each other. This will let us know how precise the predictions are in comparison to each other
cross-validation - like last chapter, we can use k-fold cross validation to train the model on sets of data and tests multiple times to get a more honest posterior predictive estimates. We can then compare the evaluation values of MAE (median absolute error), scaled_MAE (scaled median absolute error) and within 50 and within 95, the posterior predictive interval values. This will give us a sense of the comparative values of error between the observed data and the different models
ELPD - the expected log-predictive densities (ELPD) calculates the expected log posterior predictive pdf across a new set of data points. Here we can calculate the ELPD for each model can compare them. The program in R will rank the models from best to worst, best being the model with the highest ELPD. This provides an output of both the elpd difference between models and the standard error difference. This allows us to get a sense for if the differences between models fall within the standard error range of the best model. If it does fall within, then the models are not too different with respect to their ELPD scores and other considerations (complexity, cross-validation etc) might be able to break the tie on which model is best.
Briefly explain the bias-variance tradeoff and why its important.
The bias-variance trade-off is trying to balance of the predictors included in our model. We want to include enough predictors so that we have enough relevant information to predict Y, but we don’t want too many to overfit our model to the sample data. If we have a model with too few predictors it can be too simple, producing low variance but high bias. If we have a model with too many predictors it can be too precise to the data sample, producing low bias but high variance. We want to hit the sweet spot with relatively low bias and low variance.
I will use penguins_bayes data and weakly informative priors and an understanding that the average penguin weighs between 3,500 and 4,500 grams. A predictor of interest is penguin species: Adelie, Chinstrap, or Gentoo.
#alternative penguin data
data("penguins_bayes")
penguin_data <- penguins_bayes |>
filter(species %in% c("Adelie","Gentoo"))
First let’s explore the relationship between body_mass_g with flipper_length_mm and species:
I’ll make two plots, the first for the relationship between body_mass_g and flipper_length, the second between body mass and species:
ggplot(penguin_data,aes(x=flipper_length_mm,y=body_mass_g))+
geom_point()+
labs(title="Flipper Length")
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(penguin_data,aes(x=species,y=body_mass_g))+
geom_point()+
labs(title="Species")
## Warning: Removed 2 rows containing missing values (geom_point).
These plots show a suspected positive linear relationship between body mass and flipper length. It also shows that Gentoo penguins tend to have a higher body mass than Adelie penguins on average.
We can also look at the second plot in a density plot:
ggplot(penguin_data,aes(x=body_mass_g,fill=species))+
geom_density(alpha=0.5)
## Warning: Removed 2 rows containing non-finite values (stat_density).
This is perhaps a better visualization that the body mass of the Gentoo species lies above the Adelie species. Here we see an average in Gentoo of about 5000 and an average in Adelie of around 3500.
I will use the weakly informative priors for beta_1 and sigma, but use the given information to provide a prior for beta_0. We know that the average penguin weight is between 3500 and 4500. So we will take 4000 as mu and 250 as the sd (so that 2 sds cover the spread of the data):
penguin_model <- stan_glm(body_mass_g ~ flipper_length_mm + species, data = penguin_data, family = gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE),
chains=4,iter=5000*2,seed=369)
First for the visual checks with trace and chain diagrams:
mcmc_trace(penguin_model)
mcmc_dens_overlay(penguin_model)
These all look good and indicate that they are mixing well.
Then the numerical diagnostics:
neff_ratio(penguin_model)
## (Intercept) flipper_length_mm speciesGentoo sigma
## 0.47010 0.46630 0.47625 0.67745
rhat(penguin_model)
## (Intercept) flipper_length_mm speciesGentoo sigma
## 0.9999334 0.9999363 0.9999264 1.0004294
These are both good, with a Neff over 0.1 and a rhat near 1 but less than 1.05. One note is that the Neff is lower than in past examples we’ve had in these HWs where it was near or even above 1.
tidy(penguin_model,c("fixed","aux"),conf.int=TRUE,conf.level=0.9)
## # A tibble: 5 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4366. 689. -5513. -3232.
## 2 flipper_length_mm 42.4 3.62 36.5 48.5
## 3 speciesGentoo 219. 110. 37.6 401.
## 4 sigma 393. 16.9 367. 423.
## 5 mean_PPD 4315. 33.7 4260. 4370.
Here we have a beta_1 on flipper_length_mm of 42.4. This means that for every one unit increase in body mass in grams, there is an average of 42.4 unit increases in flipper length in mm.
We have a beta_2 on speciesGentoo of 218.5. This is the typical difference in body mass in Gentoo versus Adelie penguins. This means that typically, the Gentoo penguin will be 218.5 grams heavier than the Adelie penguin.
#set the seed and convert the data frame
set.seed(369)
penguin_df <- as.data.frame(penguin_model)
#set prediction notation
predict_Adelie197 <- penguin_df |>
mutate(mu = `(Intercept)` + flipper_length_mm*197 + speciesGentoo*0,
y_new =rnorm(20000,mean=mu,sd=sigma))
#plot the posterior predictive model
ggplot(predict_Adelie197,aes(x=y_new))+
geom_density()+
labs(title="197mm Adelie Penguin Prediction")
##above I autopiloted the long ways to program this. Below I will use the posterior predict function we learned this chapter##
#simulate posterior with posterior_predict
prediction <- posterior_predict(
penguin_model,
newdata = data.frame(flipper_length_mm = 197,
species = "Adelie"))
#plot
mcmc_areas(prediction) +
xlab("body_mass_g")
Both the long form and the new function give very similar results. They show a mean prediction of body mass in grams of around 4000 with a spread primarily between 3000 and 5000.
Now we’ll build a model with an interaction term between the two predictors
penguin_model_2 <- stan_glm(body_mass_g ~ flipper_length_mm + species + flipper_length_mm:species, data = penguin_data, family = gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE),
chains=4,iter=5000*2,seed=369)
penguin_data |>
drop_na() |>
add_fitted_draws(penguin_model_2,n=50) |>
ggplot(aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_line(aes(y=.value,group=paste(species,.draw)),alpha=0.15)
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## Use [add_]epred_draws() to get the expectation of the posterior predictive.
## Use [add_]linpred_draws() to get the distribution of the linear predictor.
## For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).
This plot of the simulated posterior model lines shows a bit of interaction between species and flipper length on body mass. The Adelie is overall lower in flipper length and body mass than Gentoo. In addition, the slope of most of the Adelie lines seem to be less steep than those of the Gentoo lines. Moreover, the Gentoo lines seems to have less variation in their slope than the Adelie lines do.
tidy(penguin_model_2, effects=c("fixed","aux"))
## # A tibble: 6 × 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) -2897. 891.
## 2 flipper_length_mm 34.7 4.68
## 3 speciesGentoo -3336. 1333.
## 4 flipper_length_mm:speciesGentoo 17.3 6.48
## 5 sigma 387. 16.5
## 6 mean_PPD 4315. 32.8
Let’s go back to what this is saying. For Adelie birds, the formula is reduced to:
mu = beta_0 + beta_1(flipper_length) + beta_2(Gentoo) + beta_3(flipper*Gentoo) mu = beta_0 + beta_1(flipper_length)
mu = -2897.4 + 34.7(flipper_length)
For the Gentoo bird, the formula is:
mu = (beta_0 + beta_2) + (beta_1 + beta_3)(flipper_length)
mu = (-2897.4 - 3335.5) + (34.7 + 17.3)(flipper_length)
This shows that the relationship for Gentoo is more positive than the relationship with Adelie.
With context and the visualization, we can see that there is a different in the increase between Adelie and Gentoo birds in this data set. We can check if there is substantial posterior evidence that beta_3 != 0:
posterior_interval(penguin_model_2,prob=.90,
pars=c("flipper_length_mm:speciesGentoo"))
## 5% 95%
## flipper_length_mm:speciesGentoo 6.734538 27.82725
The 90% posterior interval for the beta_3 coefficient is entirely above 0, which gives us some good evidence that it is most likely positive and non-zero. All of the above argues that there is an interaction effect and that it should maybe be taken in to account in the model.
Next we’ll explore body mass by three predictors: flipper_length, bill_length, and bill_depth. No interactions
penguin_model_3 <- stan_glm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm, data = penguin_data, family=gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = normal(0,2.5,autoscale=TRUE))
posterior_interval(penguin_model_3,prob=0.95)
## 2.5% 97.5%
## (Intercept) -7241.30804 -5058.01525
## flipper_length_mm 26.15736 38.09245
## bill_length_mm 55.96640 87.66257
## bill_depth_mm 27.98378 80.03287
## sigma 311.50310 370.21540
All of the parameters have CIs well above zero, which leads us to believe that they all have a positive association with body mass.
Consider 4 separate models of body_mass_g
body_1 <- stan_glm(body_mass_g ~ flipper_length_mm, data = penguin_data,family=gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE))
body_2 <- stan_glm(body_mass_g ~ species, data = penguin_data,family=gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE))
body_3 <- stan_glm(body_mass_g ~ flipper_length_mm + species, data = penguin_data,family=gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE))
body_4 <- stan_glm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm, data = penguin_data,family=gaussian,
prior_intercept = normal(4000,250),
prior = normal(0,2.5,autoscale=TRUE),
prior_aux = exponential(1,autoscale=TRUE))
pp_check(body_1) + labs(title="flipper")
pp_check(body_2) + labs(title="species")
pp_check(body_3) + labs(title="flipper + species")
pp_check(body_4) + labs(title="flipper + bill length + bill depth")
All of these plots seem to cover the general center and spread of the observed data. The flipper chart seems to show some bimodality, but is missing the middle ‘hump’ that the observed data shows.
The species data is bimodal as expected given what we know about the difference in mass of the two species, but really misses the middle hump in the data.
The third plot with the species and flipper seems like a few simulated lines hit the middle hump, but still overall looks bimodal.
The final plot looks like it might get more close to the shape of the observed data, though the lines look a bit muddle at the top of the shape, making it a bit hard to see.
penguins_complete <- penguins_bayes |>
select(flipper_length_mm,body_mass_g,species,bill_length_mm,bill_depth_mm) |>
na.omit()
Then we can run the cross-validation:
set.seed(369)
p1 <- prediction_summary_cv(model = body_1, data=penguins_complete,k=10)
p2 <- prediction_summary_cv(model=body_2,data=penguins_complete,k=10)
p3 <- prediction_summary_cv(model=body_3,data=penguins_complete,k=10)
p4 <- prediction_summary_cv(model=body_4,data=penguins_complete,k=10)
Let’s see it all together:
cross_body <- rbind(p1$cv,p2$cv,p3$cv,p4$cv) |>
mutate(model = row_number())
cross_body
## mae mae_scaled within_50 within_95 model
## 1 261.8418 0.6635307 0.5205882 0.9444538 1
## 2 345.3036 0.7380430 0.4708403 0.9589916 2
## 3 258.3965 0.6803209 0.5026050 0.9502521 3
## 4 266.4455 0.6677870 0.4945378 0.9388235 4
Here we see that model 2 has the highest MAE and MAEL scaled values, meaning that it fits the data the least well. All of the other three models have fairly comparable MAE values and overall the MAE scaled values are very similar. The within posterior predictive intervals are also pretty comparable across the models. For within 50, model 1 is the highest (52%) and model 2 is the lowest (47%). For within 95, model 2 is the highest (barely) with 96% and model 4 is the lowest (94%).
#calculate the ELPD for the 4 models
loo_1 <- loo(body_1)
loo_2 <- loo(body_2)
loo_3 <- loo(body_3)
loo_4 <- loo(body_4)
#look at the results
c(loo_1$estimates[1],loo_2$estimates[1],loo_3$estimates[1],loo_4$estimates[1])
## [1] -2028.370 -2081.952 -2027.384 -1987.456
Comparing the four model’s ELPD:
loo_compare(loo_1,loo_2,loo_3,loo_4)
## elpd_diff se_diff
## body_4 0.0 0.0
## body_3 -39.9 7.4
## body_1 -40.9 7.1
## body_2 -94.5 11.5
Using the loo compare, this shows that the highest ELPD is model 4 and the worst is model 2, which checks out with the cross-validation findings. There is a pretty big drop off between the best model (model 4) and the next best model (body 3). This is also saying with the standard error, the difference between model 3 and model 4 is over 5 standard errors from the estimated -40.1 difference (-40.1 +- 7.4*6) = (-84.5, 4.3).
Given the cross validation and the ELPD evaluations, I would say that model 4 is the best. Though it is the most complex, the evaluations seem to favor this model. In the 10 fold cross-validation evaluation, the MAE and especially scaled MAE values are very comparable. Alternatively, the ELPD evaluations favor model 4 and the next best model, model 3, is over 5 standard errors from an elpd of 0.